Key topics: Time Series Analysis, Forecasting, Spatial Data, R Shiny Dashboard
This report contains an exploratory data analysis of cyclist violations in NYC from 2018 to June 30, 2023, as well as a forecast for cyclist counts to 2025. Everything was coded in R markdown files which can be viewed in the github repo. There is a supplemental R Shiny Dashboard that contains selectable maps and time-frames for all violations.
While one can find a good deal of information about general cycling in NYC, it is more difficult to find information specifically about bike violations (also referred to as ‘tickets’ or ‘summons’). This report will examine some general facts obtained from available data about cycling violations in NYC, like the most common violations, violation changes over time, differences per borough, etc. It also will examine cyclist rider count data, looking at historic seasonal trends and forecasting future ridership.
Throughout this report, I will use the term ‘cyclist’ to refer to regular bike riders, e-bike riders, and e-scooter riders, and ‘bicycle’ to refer to any of the vehicles that fall under these labels.
The datasets used were all taken from the NYC Open Data website, except the violation code description data, which was taken from the NY DMV website. The violation data was merged from two different datasets, a ‘year to date’ dataset spanning Jan 1, 2023 to June 30th, 2023, and a ‘historical’ dataset dating from Jan 1, 2018 to December 31, 2022. These datasets contained all violations issued by the NYPD, but only the violations labeled ‘Bike’, ‘Ebike’, and ‘Escooter’ were chosen for this report. The bicycle count data was agglomerated per day and per week, and these values were appended to the merged violations dataset. The violation code textual description data was also merged with the violations dataset, so that a description of the violation code could easily be accessed.
mapview(na.omit(df_bicycle_counters_boroughs), xcol = "longitude", ycol = "latitude", crs = 4269, grid = FALSE)Bicycle counter locations throughout the 5 boroughs. Note that two on Amserdam Ave on the west side of Central Park have the exact same latitude and longitude in the .csv file, but in reality one is on Amsterdam ave counting bike traffic in the uptown direction and the other is an avenue over on Columbus Ave, counting bike traffic flowing downtown.
The bicycle count data was taken from 18 separate bike counters, spread throughout the 5 boroughs, as can be seen in the map above. Although a total of 29 were listed in the .csv file, some of those were either duplicates or not applicable (for example, they only counted pedestrians) and were removed from consideration. The majority of these counters could be considered ‘Manhattan-centric.’ For instance, there is only one counter in Staten Island by the ferry access to Manhattan, one in a single area in the south Bronx, and only three in Queens, with one of those placed near the Queensboro bridge to Manhattan. Based on this unequal representation, I chose not to perform any borough specific analyses involving rider count data (however, violation data was not dependent on bike counters), and any generalizations or take-aways from this report should keep these limitations in mind. The bike counters acted by counting the number of cyclists that crossed them, and recorded this aggregation every 15 minutes. The bike counter data ranged from 2012 to present, but only the data from 2018 to present were used.
Of the 126,812 initial violation entries in the specified time frame, 111 total entries were removed, leaving 126,701. Sixteen rows did not contain violations codes, 92 rows did not contain city name or location information, and 3 rows did not contain location information.
ts_daily_total |>
ggplot(aes(x = daily_total_cyclists)) +
geom_histogram(bins = 80) +
labs(title='Daily Total Cyclists Histogram') +
xlab('Cyclists Per Day') +
ylab('Count')df_bike_violations |>
mutate(violation_date = date(violation_date)) |>
arrange(desc(daily_total_cyclists)) |>
distinct(violation_date, daily_total_cyclists) |>
slice(1:10) |>
kable(caption = "Top 10 Busiest Cycling Days",
col.names = linebreak(c('Violation Date', 'Daily Total Cyclists'), align = "l")) |>
kable_classic(full_width = F, html_font = "Cambria", position = "left")| Violation Date | Daily Total Cyclists |
|---|---|
| 2019-10-30 | 59750 |
| 2019-11-04 | 59291 |
| 2019-11-05 | 58329 |
| 2020-09-12 | 57637 |
| 2019-11-27 | 57456 |
| 2019-07-16 | 57394 |
| 2019-11-06 | 57384 |
| 2019-11-26 | 57323 |
| 2020-06-13 | 57195 |
| 2020-11-07 | 56453 |
The amount of total cyclists per day ranged from 1665 to almost 60,000. The busiest days tended to be in late October/November of 2019/2020, with a few in the middle of summer and a few in September. Further investigation could help uncover why this is, but my guess is that people might be trying to squeeze in one last ride before winter.
ts_daily_total |>
autoplot(daily_total_cyclists) +
geom_smooth(method = "lm") +
ggtitle('Daily Total Cyclists Time Plot') +
xlab('Violation Date') +
ylab('Daily Totaly Cyclists')From the plot of daily total cyclists a general increasing trend can be seen, as well as strong seasonality. There are more cyclists in the warmer summer months than in the colder months. The slope for the regression line is ~6.54, which indicates that there are on average 6.5 new riders per day.
Let’s further explore the seasonality. It’s easier to view this with monthly totals:
ts_monthly_total |>
gg_season(monthly_total_cyclists) +
ggtitle('Seasonal plot of total monthly ridership') +
xlab('Violation Month') +
ylab('Monthly Total Cyclists')
ts_monthly_total |>
gg_subseries(monthly_total_cyclists) +
ggtitle('Subseries plot of total monthly ridership') +
xlab('Violation Month') +
ylab('Monthly Total Cyclists')The increase in ridership in the warmer months can clearly be seen. Also note the yearly increase, as well as the dip in April of 2020 during covid lockdown.
# Note: Count values were copied and pasted from 2_NYPD_Bike_viol_EDA.rmd. If that file is
# updated, make sure to update these values.
location <- c('Williamsburg side of Williamsburg bridge',
'Queens side of Queensboro Bridge',
'Manhattan side of Manhattan bridge',
'Kent ave in Williamsburg',
'Brooklyn Bridge')
count_2022 <- c(1964902, 1818163, 1584788, 1066870, 1002070)
df_busiest_counters <- data.frame(location, count_2022)
df_busiest_counters |>
kable(caption = "Busiest Bike Counters, 2022",
col.names = linebreak(c('Counter Location', 'Total Counted Cyclists, 2022'), align = "l")
) |>
kable_classic(full_width = F, html_font = "Cambria", position = "left") |>
row_spec(1, bold=T)| Counter Location | Total Counted Cyclists, 2022 |
|---|---|
| Williamsburg side of Williamsburg bridge | 1964902 |
| Queens side of Queensboro Bridge | 1818163 |
| Manhattan side of Manhattan bridge | 1584788 |
| Kent ave in Williamsburg | 1066870 |
| Brooklyn Bridge | 1002070 |
The busiest bike counters in 2022 were at bridge access points as well as Kent Ave. in Williamsburg, which feeds onto the Williamsburg bridge. For 2023 (through June 30th), the order is mostly the same, with Kent Ave. and Brooklyn Bridge switching places. An analysis of daily total count data from just the Williamsburg bridge counter revealed similar seasonal trends to the total count data from all counters.
ts_daily_total |>
autoplot(daily_total_violations) +
ggtitle("Totaly Daily Violations") +
xlab('Violation Date') +
ylab('Violation Count')There were 126,701 total violations consisting of 188 different types of violations handed out to cyclists during the time frame this data was collected. As we can see from the Total Daily Violations plot, there was a sharp drop-off in daily violations handed out after covid lockdown (April, 2020). I do not know why this is, and more information would be needed. Some seasonality is present as well, although not as pronounced as was seen in the count data. Fewer violations were handed out around the Christmas and New Year’s holidays, while more were handed out in the late summer/early autumn.
df_bike_violations |>
group_by(year=year(violation_date), city_nm) |>
ggplot(aes(x=year, fill=city_nm)) +
geom_bar() +
labs(title="Yearly Cyclist Violations",
subtitle = 'Through June 30, 2023',
fill="Borough") +
scale_fill_brewer() +
xlab('Year') +
ylab('Violations')Per borough, we can see that Manhattan had the most violations, with Brooklyn coming in second. More information is required to determine why. It’s possible that the police presence in Manhattan is higher. It is also possible that there are far more cyclists in Manhattan than in the other boroughs, although this can not be determined from the current data due to the disproportional bike counter placement.
df_bike_violations |>
mutate(violation_date = date(violation_date)) |>
summarise(total = sum(n()), .by = c(violation_code, description)) |>
arrange(desc(total)) |>
mutate(percent = 100 * round(total / sum(total), 3)) |>
slice(1:10) |>
kable(caption = 'Top 10 Violations',
col.names = linebreak(c('Violation Code', 'Description', 'Total', 'Percent'))) |>
kable_classic(full_width = F, html_font = "Cambria", position = "left")| Violation Code | Description | Total | Percent |
|---|---|---|---|
| 1111D1C | BICYCLE OR SKATEBOARD FAILED TO STOP AT RED LIGHT- NYC | 55933 | 44.1 |
| 1110AB | DISOBEYED TRAFFIC DEVICE WHILE OPERATING BICYCLE | 17166 | 13.5 |
| 1127AB | DRIVING WRONG DIRECTION ON ONE-WAY STREET - BICYCLE | 7995 | 6.3 |
| 403A3IX | BICYCLE FAILED TO YIELD TO VEHICLE/PEDESTRIAN AT RED LIGHT- NYC | 6682 | 5.3 |
| 37524AB | OPER BICYCLE WITH MORE 1 EARPHONE | 6052 | 4.8 |
| 1236B | NO BELL OR SIGNAL DEVICE ON BICYCLE | 3959 | 3.1 |
| 412P1 | BIKING OFF LANE- NYC | 3673 | 2.9 |
| 407C31 | BIKE/SKATE ON SIDEWALK-NYC | 3459 | 2.7 |
| 1232A | IMPROPER OPERATION OF BICYCLE | 2449 | 1.9 |
| 1111D1N | NYC REDLIGHT | 1925 | 1.5 |
From the Top 10 Violations table, we can see that improper traffic behavior at red lights (1111D1C, 403A3IX), account for ~50% of all cyclist violations. Other common violations include driving in the wrong direction, operating a bicycle with more than 1 earphone, not having a bell, and riding on the sidewalk.